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ABSTRACT 

In this paper, we consider the problem of estimating self-tuning his- 
tograms using query workloads. To this end, we propose a general 
learning theoretic formulation. Specifically, we use query feedback 
from a workload as training data to estimate a histogram with a 
small memory footprint that minimizes the expected error on future 
queries. Our formulation provides a framework in which different 
approaches can be studied and developed. We first study the simple 
class of equi-width histograms and present a learning algorithm, 
EquiHist, that is competitive in many settings. We also provide 
formal guarantees for equi-width histograms that highlight scenar- 
ios in which equi-width histograms can be expected to succeed or 
fail. We then go beyond equi-width histograms and present a novel 
learning algorithm, SpHist, for estimating general histograms. Here 
we use Haar wavelets to reduce the problem of learning histograms 
to that of learning a sparse vector. Both algorithms have multiple 
advantages over existing methods: 1) simple and scalable exten- 
sions to multi-dimensional data, 2) scalability with number of his- 
togram buckets and size of query feedback, 3) natural extensions to 
incorporate new feedback and handle database updates. We demon- 
strate these advantages over the current state-of-the-art, ISOMER, 
through detailed experiments on real and synthetic data. In particu- 
lar, we show that SpHist obtains up to 50% less error than ISOMER 
on real-world multi-dimensional datasets. 

1 Introduction 

Histograms are a central component of modern databases. They are 
used to summarize data and estimate cardinalities of sub-expressions 
during query optimization. Typically, histograms are constructed 
solely from data and are not workload-aware. This approach has 
known limitations |2, 21 1: First, a workload may access data non- 
uniformly due to which a workload-oblivious histogram might waste 
resources (e.g., space) on infrequently accessed parts of the data. 
Second, constructing a histogram from data can be expensive, re- 
quiring a scan or a sample of data; further, maintaining a histogram 
in the presence of updates is nontrivial. The standard approach is to 
rebuild histograms from scratch after a certain number of updates, 
resulting in possibly inaccurate histograms between builds 1 1 1 |. 

To address these limitations, prior work has proposed self-tuning 
histograms fni2l l2II . Briefly, the idea is to collect query feedback 
information during query execution and to use this information to 



build and refine histograms. Query feedback is typically cardinali- 
ties of filter expressions over tables (e.g., |(T5<A<10A2<fl<3(J?)| = 
10); following prior work |21] we call such cardinalities together 
with the corresponding query expressions as query feedback records 
(or QFRs). Such feedback can be collected with minimal over- 
head during query execution by maintaining counters at operators 
in the query plan [1 15|. The overall idea is that since query 
feedback captures data characteristics relevant to a workload, his- 
tograms built using query feedback would be more accurate for 
similar workloads. Also, histograms can be refined as new feed- 
back information is available and hence one can track changes in 
data characteristics arising from updates. 

Over the years, several self-tuning histograms have been intro- 
duced, such as STGrid |1|, STHoles |2|, ISOMER |21|. Each 
of these methods uses an interesting way of selecting histogram 
bucket boundaries as well as fixing histogram bucket heights. How- 
ever, most of the existing methods lack theoretical analyses/guarantees 
and do not scale well with high dimensions or large number of 
QFRs. Of particular interest is ISOMER, [21 1, the current state- 
of-the-art in self-tuning histograms. ISOMER uses query feed- 
back to compute a "consistent" and unbiased histogram based on 
the maximum- entropy (maxent) principle. However, the obtained 
histogram might have Q{N) buckets, given A'^ QFRs. To get a his- 
togram with k <ti N buckets, ISOMER heuristically eliminates up 
to {N — k) feedback records. This step discards valuable feed- 
back information and can have an adverse impact on quality, as 
empirically demonstrated in Section|4] Furthermore, it hinders the 
method's scalability to high-dimensions or large number of QFRs. 
Another limitation of this approach is that it is not robust to database 
updates. Updates can produce inconsistent query feedback for which 
the maxent distribution is undefined. Again, ISOMER heuristically 
discards potentially useful QFRs to get a consistent subset. 

In this paper, we propose and study a simple learning-theoretic 
formalization of self-tuning histograms. Informally, we model the 
QFRs as training examples drawn from some unknown distribution 
and the goal is to learn a fc-bucket histogram that minimizes the 
expected cardinality estimation error on future queries. Our for- 
malization is based on standard learning principles and confers sev- 
eral advantages: (1) Our learning algorithms leverage all available 
feedback information (unlike ISOMER) and in many scenarios this 
additional information translates to dramatic (order-of-magnitude) 
reductions in cardinality estimation errors (see Section|4ll. (2) Our 
framework lends itself to efficient algorithms that are scalable to 
multiple dimensions and large number of QFRs (3) Our formaliza- 
tion is database-update friendly: it is inherently robust to incon- 
sistent QFRs and can easily incorporate natural strategies such as 
using higher weights for recent QFRs compared to older ones. 

We next list our main algorithmic contributions: 



1. Equi-width histograms: We begin by studying 1-dimensionaI 



equi-width histograms in our learning framework and provide an 
efficient algorithm (EquiHist) for the same. When the number of 
buckets is reasonably large (relative to how spiky the true distribu- 
tion is), this approach performs well in-practice. We also present a 
theoretical analysis that shows that the error incurred by the learned 
equi-width histogram is arbitrarily close to that of the best overall 
histogram under reasonable assumptions. This result is of indepen- 
dent theoretical interest; we know of no prior result that analyzes 
cases when equi-width histograms can be expected to succeed/fail. 

2. Sparse-vector Recovery based method: One of the main con- 
tributions of this paper is a novel reduction from the general his- 
togram learning problem to a sparse- vector recovery problem. In- 
formally, using Haar wavelets, we represent a histogram as a sparse 
vector. We then cast our learning problem as that of learning a 
sparse- vector. To this end, we provide an efficient algorithm (SpHist) 
by adapting techniques from compressed sensing |22|. 

3. Multi-dimensional histograms: Our equi-width and sparse vec- 
tor recovery algorithms admit straightforward generalizations to 
multi-dimensional settings. We also show that the error bounds for 
equi-width histograms extend to multiple dimensions. Also, SpHist 
admits a powerful class of histograms characterized by sparsity un- 
der Haar wavelet transformations. Using results in |24|, we show 
that not only does this class of histograms have small memory foot- 
print, but it can also estimate cardinality for high dimensional range 
queries as efficiently as existing self-tuning histograms. 

4. Dynamic QFRs and Database Updates: We present online vari- 
ants of our algorithms that maintain a histogram in the presence 
of new QFRs. We also present extensions to incorporate database 
updates that ensure that our learned histograms remain accurate. 

Finally, we include extensive empirical evaluation of our pro- 
posed techniques over real and standard synthetic datasets, includ- 
ing comparisons with prior work such as ISOMER. Our empiri- 
cal results demonstrate significant improvement over ISOMER in 
terms of accuracy in query cardinality estimation for several sce- 
narios and databases. 

Outline: We present notations and preliminaries in Section [2] 
We then present our equi-width as well sparse-recovery based ap- 
proaches in Section |3] We provide empirical evaluation of our 
methods in Section |4]. In Section [5] we present some of the re- 
lated works to our work and contrast them against our methods, 
and finally conclude with Section|6] 

2 Notation and Preliminaries 

In this section we introduce notation and review concepts from 
learning used in the paper. 

For 1-dimensional histograms, R and A denote the relation and 
the column, respectively, over which a histogram is defined. We 
assume throughout that the domain of column A is [1, . . . , r]; our 
algorithms can be generalized to handle categorical and other nu- 
meric domains. Also, let M be the number of records in R, i.e., 
M = |_R|. A histogram over A consists of k buckets Bi, . . . , Bk- 
Each bucket Bj is associated with an interval [£j, Uj] and a count 
Uj representing (an estimate of) the number of values in R{A) that 
belong to interval [£j,Uj]. The intervals [£j,Uj]{l < j < k) are 
non-overlapping and partition the domain [l,r]. We say that the 
width of bucket Bj is {uj — + 1). A histogram represents an ap- 
proximate distribution of values in R{A); the estimated frequency 
of value i £ [lj,Uj] is ■ We use interval [(., it] to represent 

the range query crA&[i,u] (R)- 

For conciseness, we use a vector notation to represent queries 
and histograms. We denote vectors by lower-case bold letters (e.g. 
w) and matrices by upper-case letters (e.g.. A). The term Wi de- 
notes the i-th component of w and a^b (or a • b) = Uibi 



denotes inner product between vectors a and b. We represent a 
histogram as a vector h £ R' specifying its estimated distribution, 
i.e., hi = ^ ■ By definition h is constant in each of k bucket 

intervals and we refer to such vectors as k-piecewise constant. We 
represent a range query q — it] in unary form as q G R*^ where 
qi = l,Vi G u] and qi = 0, otherwise. Hence, the estimated 
cardinality of q using h, denoted Sq, is given by Sq = q^h. 

When discussing multi-dimensional histograms, we use A\,. . . , 
Ad to denote the d columns over which a histogram is defined. For 
ease of exposition, we assume all column domains are [1, r]. We 
first present notation for d = 2: A histogram is a (estimated) value 
distribution over every possible assignment of values to columns 
A\ and Ai^ and can be represented as a matrix H G R"^^*^. A 
fc-bucket histogram has k non-overlapping rectangles with uniform 
estimated frequency within each rectangle; we also consider other 
kinds of "sparse" histograms that we define in Section [J!4l A query 
Q is of the form o"Aie[£i,ui]AA2e[f2,"2] (^) represented 
in unary form as a matrix Q G R*^^*^. The estimated cardinality of 
query Q using histogram H is given by their inner product sq = 
{Q,H) = TrijQ'^H). For d > 2, histograms and queries are d- 
dimensional tensors (G R'"^"^'"); estimated cardinality of query 
Q g ]jr X • ■ • X r histogram _ff G R'' ^ ' ' ' ^ is given by tensor inner 
product Sq = (Q, H). 

Lp-norm: ||x||p denotes Lp norm of x G R' and is given by 

Lipschitz Functions: A function / : R*" — >• R is L-Lipschitz con- 
tinuous if: Vx,y, |/(x) - /(y)| < ||x - y||2 ■ L. 

Convex Functions: A function / : R'' — >• R is convex if: 
VO < A < l,x,y G R'-, /(Ax+(1-A)y) < A/(x)+(l-A)/(y). 
Furthermore, a / : R*^ — >■ R is a-strongly convex (a > 0) w.r.t. L2 
norm if, V < A < 1, x, y G R'': 

/(Ax + (1-A)y) < A/(x) + (l-A)/(y)-Q^^i^||x-y||^. 

Let G R"^^' be the Hessian of /, then / is a-strongly convex iff 
smallest eigenvalue of H is greater than a. 

Empirical-risk minimization: In many learning applications, the 
input is a set of training examples X = {xi, 1 < i < n} and their 
labels/predictions Y — {yi,l < i < n}. The goal is to mini- 
mize expected error on unseen test points after training on a small 
training set X. Empirical-risk minimization (ERM) is a canonical 
algorithm to provably achieve this goal for a setting given below. 

Let each training sample Xi be sampled i.i.d. from a fixed distri- 
bution V, i.e. Xi ~ V. Let £(w; x) : R'' x R'' R be the loss 
function that provides loss incurred by model parameters w for a 
given point x. Then, the goal is to minimize expected loss, i.e., 

min F(w) = Ex, [^(w; x)] . (1) 

Let w* be the optimal solution to l[T). 

Typically, the distribution D is unknown. To address this, the em- 
pirical risk minimization (ERM)approach uses the empirical distri- 
bution derived from the training data in lieu of V. Formally, ERM 
solves for: 

w = min F(w) = - y^^(w;xi). (2) 

■i=l 

If loss function £ satisfies certain properties then we can prove 
bounds relating the quality (objective function value F{-)) of w 
and w*. In particular, |20| provided a bound on F{w) for the case 
of strongly-convex loss functions: 

Theorem 1 (Stochastic Convex Optimization ||201| ). 
Let £{vf; x) = /(w^x; x) -|- /i(w), where ft : R"" — >■ R w a a- 
strongly convex regularization function. Let f{u; x) be a convex 



Lf-Lipschitz continuous function in u and let ||x||2 < -R. Let w* 
be the optimal solution of Problem (TJ and w be the optimal so- 
lution of Problem (|2j- Then, for any distribution over x arui any 
5 > Q, with probability at least 1 — 5 over a sample X of size n: 

F w -F w* <C) ' ' ' ] . (3) 

\ an I 

Hence, the above theorem shows that solving ERM (i.e., Prob- 
lem ^) serves as a good "proxy" for solving the original problem 
(TJ and also the additional expected error can be decreased linearly 
by increasing the number of training samples, i.e., n. 

Haar-wavelets: Wavelets serve as a popular tool for compressing a 
regular signal (a vector in finite dimensions for our purposes) 1 14|. 
In particular, Haar wavelets can compress a piece-wise constant 
vector effectively and can therefore be used for a parsimonious rep- 
resentation of fc-bucket histograms. 

Haar wavelet performs a linear orthogonal transformation of a 
vector to obtain wavelet coefficients. In particular, given a vector 
X G MJ', we obtain a vector a € M'" of wavelet coefficient using: 

a = *x, (4) 

where ^I* G R'''^'" is the wavelet transform matrix given by: 

ifi = l,0<j<r, 

-yZE^ ifi<i<,,_^(,+ i-2r'--i)<i 

< rw i (i + l-2^'°g^'1), 
— 2riog2 »i 

_ otherwise. 

(5) 

We can show that if x is fc-piecewise constant in Equation|4l a has 
at most k log r non-zero coefficients. 

For signals in higher dimensions, wavelet transform can be ob- 
tained by first vectorizing the signal and then applying the transfor- 
mation of Equation |4] (see 1141 for more details). Several existing 
studies show that high-dimensional real-life data mostly resides in 
a small number of clusters and hence most of the wavelet coeffi- 
cients are nearly zero |7|, i.e., vector of coefficients is sparse. 

3 Method 

In this section, we present our learning theoretic framework and 
algorithms. We first review the architectural context for self-tuning 
histogram learning. 

Architecture: We assume the architectural context of prior work 
in self-tuning histograms |21|. In particular, we assume opera- 
tors in query plans are instrumented to maintain counters and pro- 
duce query feedback records (QFRs) at the end of query execu- 
tion. Recall that a QFR is a filter sub-expression and its cardinal- 
ity. (In the following, we abuse notation and refer to such sub- 
expressions as queries although they are actually parts of a query.) 
These QFRs are available as input to our learning system either 
continuously or in a batched fashion (e.g., during periods of low 
system load). Although we present our learning framework assum- 
ing QFRs are the only input, we can extend our framework to in- 
corporate workload-independent data characteristics, e.g., an initial 
histogram constructed from data. 

First, in Section lSTl we present our learning theoretic framework 
for self-tuning histograms. In Section [X2l we study equi-width his- 
tograms in our framework and present a learning algorithm for the 
same (EquiHist). We also present formal error analysis for his- 
tograms learned by EquiHist. Equi-width histograms are known 



to be unsuitable in many settings, such as sparse high-dimensional 
datasets. To handle this, in Section [331 we present an algorithm 
(SpHist) for learning general histograms that relies on a reduction 
from histogram learning to sparse vector recovery. For presenta- 
tional simplicity. Sections [3.113. 3 [ assume static QFRs, no database 
updates, and 1 -dimensional histograms. We extend our algorithms 
to multidimensional histograms in Section [J!4[ and dynamic data 
and QFRs in Section [331 

3.1 Problem Formulation 

We now formalize the histogram estimation problem. Our formu- 
lation is based on standard learning assumptions where we assume 
a training query workload of QFRs which is sampled from a fixed 
distribution and the goal is to estimate a fc-bucket histogram that 
incurs small expected error for unseen queries from the same dis- 
tribution. We consider histograms over column A of relation R. 
Recall from Section [2] that domain of A is [l,r] and a fc-bucket 
histogram h £ R'^ is a fc-piecewise constant vector. 

Let 2? be a fixed (unknown) distribution of range queries over 
R{A). Let Q = {(qi, Sqi), . . . , (qiv, Sq„)} be a query workload 
used for training where each ~ D, VI < i < A'^ and Sq is the 
cardinality of query q when evaluated over R. 

Let /(sq; Sq) : R x R — >■ R be a loss function that measures the 
error between the estimated cardinality, Sq, and actual cardinality, 
Sq, of query q. Since the estimated cardinality of q using histogram 
h is Sq = q"^h, the error incurred by h on q is /(q"^h; Sq). Ex- 
ample loss functions include Li loss (/(q^h; Sq) = Iq'^h — Sq|) 
and L2 loss (/(q^h; Sq) = (q^h - Sqr). 

Our goal is to learn a fc-bucket histogram h G R' that minimizes 
the expected error Fih) = Eq^-D[/(q^h; Sq)] incurred by h on 
test queries sampled from T). Formally, we define our histogram 
estimation problem as: 

min F-v{h), s.t. h G C, (6) 

h 

and let h* be the optimal solution to the above problem, i.e., 

h* — argminFx>(h), (7) 
hec 

where C represents the following set of histograms: 

C = {h :hGR'^isa histogram over range [1, r]with 

at most fc buckets and minimum bucket-width A} (8) 

Note that C only contains histograms whose each bucket is of width 
at least A. Parameter A can be arbitrary; we introduce it for the 
purpose of analysis only. While our analysis do not make any as- 
sumption on A, naturally, bounds would be better if A of the opti- 
mal histogram is large, i.e., the optimal histogram is relatively flat. 

We next study equi-width histograms in our framework and pro- 
vide an efficient algorithm for the same. 

3.2 Equi-width Approach 

In this section, we study equi-width histograms for solving Prob- 
lem ([6} and also provide approximation guarantees for the obtained 
method. 

Observe that set C (Equation [8} is a non-convex set, hence we 
cannot apply standard convex optimization techniques to obtain the 
optimal solution to (|6j. To handle this, we relax the problem by 
fixing the bucket boundaries to be equi-spaced. That is, we first 
consider the class of histograms with b equal-width buckets: 

C' = {h : h G R"^ is a histogram over integer range [1, r] 

with fe equal- width buckets}. (9) 

For ease of exposition, we assume r is divisible by b. Note that, 
for any h G C', we can find w G R*" such that h = Bw, where 



B eR'''"' and 



1 if 1 . (j - 1) < i < r . 



otherwise. 
For illustration (with j- = 2), 



(10) 



Theorem 2. Let f -. R x R ^ R be a convex Lj-Lipschitz 
continuous loss function. Let w be the optimal solution to ( |16t , 
h = Bw and let each query ~ T>. Let h* be the optimal solu- 
tion to ^ and has minimum bucket width IS.. Let\Q\ = maxq^i> ||q| 
i.e., the largest range of any query and let C\,C2 > Qbe universal 
constants. Then, if the number of training queries (N) satisfies: 







- 1 


••• - 


Wl 




1 


••• 


W2 







1 • • • 


W2 







1 • • • 



W2 



(11) 



L).logi 



\Q\L 

and if the number of buckets (b) in h satisfies: b > C2k , we 
have 

FT,{h) = Ei,[/(q^h; Sq)] < F(h*) + Me. 



Therefore, searching for a histogram h over C' is equivalent to 
searching for a w £ R*". Furthermore, optimal h* to l|6j should 
satisfy: ||h*||i = "^^hi — M, i.e., total number of database 
records. Also, h* should have minimum bucket width A, hence 
l|h*||oo < Using these observations, we can constraint w to 
belong to a convex set IC: 



Mb 



and llwll 



<-}. 



(12) 



K: = {we R"] ||w||i 

r 

Thus, C' can be redefined as: 

C' = {h : h = Bw,w G fC} 

and searching for a histogram h over C' induces a corresponding 
search for w over AC, which is a convex set. 

Using the above observations, we obtain the following relaxed 
problem: 



(13) 



See Appendix lA.ll of our full-version 1231 for a detailed proof. 

Note that the above theorem shows that the histogram h that we 
learn satisfy both the required properties: 

• Number of buckets (6) in h is given by 6 = C2k ^^./ . Hence, 

number of buckets in h are larger than fc by a small approximation 
factor. In fact if queries are generally "short", i.e., \Q\ is smaller 
than A, then the approximation factor is a constant dependent only 
on the accuracy parameter e. 

• Relative Expected Error incurred by h is only e, while sampling 



queries. Note that our bound on A*" is 



min Fd (Bw) . 



(14) 



To avoid overfitting, we add entropy regularization to the above 
objective function. This leads to the following relaxed problem: 



independent of r, hence the number of queries is not dependent on 
the range of the space, but only on the "complexity" of histograms 
and queries considered, i.e., on A and \Q\. Our bound confirms 
the intuition that if A is smaller, that is, the optimal histograms 
have more buckets and is more "spiky", then the number of queries 
needed is also very large. However, if A is a constant factor of 
range r, then the number of queries required is a constant. 

Now, the above bounds depend critically upon the loss function 
/ through its Lipschitz constant Lf. In the following corollary, we 



mjn G-d{w) = Et, |^/(q^Bw; Sq)J - ^H(—w), (15) provide bounds for loss function /(q^h; Sq) 



where M is the size of the relation R, i.e., M = \R\, A > is a 
constant specified later and H{jYi^w) — — X]j=i mF ^ ^'^S IjF 
is the entropy of jj^w. Note that we normalized each Wi by mul- 
tiplying by r /{Mb) to make it a probability distribution. 

The distribution T) is unknown except through the example queries 
in Q, and so we cannot directly solve l llSt . As mentioned in Sec- 
tion|2] we instead optimize an empirical estimate (G'(-)) of the ob- 
jective Gt>{ ), which finally leads us to the following problem: 



q h- Sqj. 

Corollary 3. Let f{q^ h;Sq) = \ci^ h — Sq\, then under the 
assumptions of Theorem\2\and by select N, b to be: 

3 log! 



iV > Ci 



M 

A 



,b> C2k 



Ae2' 



Then, 



|q h 



1 " 

min G(w) = -T- > f (q, Bw; s„. ) — A//(-----w), 



Note that /(q^h; Sq) = |q^h 



< F(h*) + Me. 

is 1-Lipschitz convex func- 



(16) 



wsK - ' N ' ^" 'Mb 

where JC is given by l ll2b . Let w be the optimal solution to il6\ 



w = argminG(w) 



h = Bw. 



Now note that the above relaxed problem is a convex program and 
can be solved optimally and efficiently using standard convex op- 
timization methods. However, the obtained solution need not be 
optimal for our original problem l|6j. 

Interestingly, in the following theorem, we show that optimal 
equi-width histogram h G C is a provably approximate solution to 
the original problem l[6]l. In particular, the theorem shows that by 
training with a finite number of queries (A'^) and selecting number 
of buckets 6 to be a multiplicative factor larger than the required 
number of buckets fc, the objective function (in Problem|6ll at h is 
at most e larger than the optimal value. 



tion. The above corollary now follows directly from Theorem|2] 
EquiHist Method: While selecting the loss function to be Li loss 
(/(q"^h;Sq) — lq"^h — Sq|) provides tight bounds, in practice 
optimization with Li loss is expensive as it is not a smooth differ- 
entiable function. Instead, for our implementation, we use L2 loss 
and select regularization parameter A = 0. Hence, the empirical 
risk minimization problem that our Equi-width Histogram method 
(EquiHist) solves is given by: 



G(w) 



mm 

W6K'' 



1 



Bw 



(17) 



Using techniques similar to Theorem [21 we can easily obtain ap- 
proximation guarantees for the optima of the above problem. Also, 
the above optimization problem is the well-known Least-squares 
problem and its solution can be obtained in closed form. Algo- 
rithm[T]provides a pseudo-code of our method (EquiHist). Here, Q 
is a matrix whose each row contains the training query q; and s is 
the column vector containing corresponding query cardinalities. 



Algorithm 1 EquiHist: Equi-width histogram based method for 
Histogram Estimation (1 -dimensional case) 



Algorithm 2 SpHist: Sparse-recovery based Histogram Estimation 
(1 -dimensional case) 



1: Input: Training Queries: Q £ 



where i-th column Qi £ 



W is the i-th query q^. s = [sq^ ; Sq^ ; • . • ; Sqjy] £ the 

column vector of training query cardinalities 
2: Parameters: k: number of histogram buckets 
3: B G is as given (with b = k). 

4: w ^ {B'^QQ^B)-^B^Qs (solution to ^Ji) 
5: h = Bw 
6: Output: h 

3.3 Sparse-vector Recovery based Approach 

In the previous subsection, we provided an approximation algo- 
rithm for Problem|6l by fixing bucket boundaries to be equi-width. 
However, when the number of buckets required is extremely small 
then selecting large equi-width buckets might incur heavy error in 
practice. Furthermore, in high-dimensions the histograms can be 
very "spiky", hence minimum bucket width A might be small, lead- 
ing to poor accuracies both theoretically as well in practice. 

To alleviate the above mentioned problem, we formulate a sparse- 
vector recovery based method that is able to use recently developed 
methods from sparse vector recovery domain. For this purpose, we 
use the L2 loss for our objective function: 



1 



Qi h) . 



(18) 



Now, we use wavelet basis to transform h into its wavelet coef- 
ficients. Let "if be the Haar wavelet basis, and a = 'I'h be the 
wavelet transform of h. Since 5' is orthonormal, we can rewrite 
cardinality estimation using h as: 



(19) 



where a is the vector of Haar wavelet coefficients of h. Further- 
more, using standard results in wavelet transforms 1141 . if h is k- 
piecewise constant then the wavelet transform has at most k log r 
non-zero coefficients. As k is significantly smaller than R, hence 
wavelet transform of h should be sparse and we can use sparse- 
vector recovery techniques from compressed sensing community 
to recover these wavelet coefficients. 

We now describe our sparse-vector recovery based approach to 
estimate histograms. Below, we formally specify our sparse-wavelet 
coefficient recovery problem: 



a = argmm 

SUpp(cK) <fc 



1 



T,r,T n2 



(20) 



where supp(Q:) is the number of non-zeros in a. 

Note that the above problem is in general NP-hard. However, 
several recent work in the area of compressed sensing 1 5 , 4 1 show 
that under certain settings a* can be obtained up to an approxi- 
mation factor. Unfortunately, random range queries do not satisfy 
necessary conditions for sparse-vector recovery and hence formal 
guarantees for this approach do not follow directly from existing 
proof techniques. We leave proof of our approach as future work. 

Instead, we use sparse-recovery algorithms as heuristics for our 
problem. In particular, we use one of the most popular sparse- 
recovery algorithm. Orthogonal Matching Pursuit (OMP) 1 22 1 . OMP 
is a greedy technique that starts with an empty set of coefficients 
(i.e. supp(q;) = 0). Now, at each step OMP adds a coefficient 
to the support set which leads to largest decrease in the objective 
function value. After greedily selecting k coefficients, we obtain ct 
and its support set with at most k coefficients. 



1: Input: Training Queries: Q £ B 

W is the i-th query q^. s = [sq^ ; ; . . . ; 
column vector of training query cardinalities 
2: Parameters: k: number of histogram buckets 
3: Set support set S — (f}, residual zo = s. 



where i-th column Qi € 



is the 



'■) 



4: Set ^ = Q^*^ (note that A&i 
5: t = l 
6: repeat 

7: Find index It = argmax^^j^ 
8: 5 = 5u{/t} 
9: a* = (a' e R'') 

10: Least Squares Solution: ctg — argmin^^gj, 

s||2. // As is the column submatrix of A whose columns 
are listed in 5. is the sub-vector of a* with components 
listed in set 5. 



\\Asas 



Update residual: Zt 
t + 1 
until (t < k) 

,k 



Aa^ 



a — OL 

Form h — '^^ a. 

Apply modified version of DP Method of 1 12] to h to obtain h 
with k buckets 
17: Output: Histogram h with k buckets 

Let, 6l be computed using OMP method, then we obtain our es- 
timated histogram h using: 



Note that, if a has k non-zeros then h will have at most 3fc non- 
zeros |16|, hence our estimated histogram has small number of 
buckets. To further decrease the number of buckets to k, we use the 
dynamic programming based method by [12| that produces small 
number of buckets if heights (or probability density value) for each 
attribute value is provided. Also, the method of 1 12| runs in time 
quadratic in the number of attribute values, i.e., range r. However, 
since our frequency distribution (histogram h) has only 3fc buck- 
ets, we can modify the Dynamic Programming based algorithm of 
1 12 1 so that it obtains the optimal solution in time O(fc^). Algo- 
rithm|2]provides a pseudo-code of our algorithm. Q denotes train- 
ing queries matrix and As G R^*^ represents a sub-matrix of A 
formed by yl's columns indexed by S. 

3.4 Multi-dimensional Histograms 

In the previous two subsections, we discussed our two approaches 
for 1-dimensional case. In this section, we extend both the ap- 
proaches for multi-dimensional case as well. In next subsection we 
discuss EquiHist generalization to multiple dimensions and in Sub- 
section U.4.21 we generalize our sparse recovery based approach. 

3.4.1 Equi-width Approach 

We first provide an extension of the equi-width approach to the 2-d 
case and then briefly discuss extensions to general multi-dimensional 
case. 

Recall that, given a set of range queries Q = {Qi, . . . , Qn} 
and their cardinality s = {sq^ , sq^ , ■ ■ ■ , SQ„} where Qi G R'"*^'' 
and Qi ~ D, the goal is to learn histogram H G R'^*^ such that H 
has at most k buckets. For 2-D case, we consider a bucket to be a 
rectangle only. Note that H can have arbitrary rectangular buckets, 
hence the class of H considered is more general than STGrid. But, 
our class of H is restricted compared to STHoles, which has an 



extra "universal" bucket. Now, as for the 1-d case, the goal is to 
minimize expected error in cardinality estimation, i.e.. 



mmET,[f{{Q,H);sQ)]., 



(21) 



where {Q,H) = Tr{Q^ H) denotes the inner product between Q 
and H, and C is given by: 

C = {if e R'"^'' : H has fc rectangular buckets 

and minimum bucket size A x A} (22) 

Similar to 1-d case, we restrict histograms to set C' that consists of 
6 = 61 X bi equi-width buckets. Now it is easy to verify that for 
any H e C, we can find a matrix W £ R^'i ^''^ s.t.. 



H = BWB' 

where B e R'''"*'! is as defined in i fTot . 
Hence, C' can be defined as: 



(23) 



C' ^{H ■.H = BWB'^, W G K.}, (24) 

where /C is a convex set defined analogously to l ll2t . 

Selecting entropy regularization and using empirical estimate for 
optimization, we reduce l l21t to the following problem: 



mm 



1 ^ 



(25) 

where M = |7?|, i.e. relation cardinality and A > is a constant. 

Let W be the optimal solution to the above problem and let H = 
BWB^ . Now, similar to 1-D case, we bound the expected error 
incurred by H when compared to the optimal histogram H* £ C 
to Probleml2T] 

Theorem 4. Lef / : R x R — s> R fee a convex Lj-Lipschitz 
continuous loss function. Let W = argmirijyg^ G{W) and H = 
BWE"^ and each Qi ~ V. Let K. C 'B^^'^^^ be a convex set; if 
each W £ IC is treated as a bi x bi = b-dimensional vector, then 
JC is selected to be the intersection of the Li ball of radius and 
Loo ball of radius If we are given that 



N > 



, A / 6- e- 
then we have 

F{H) = Ei,[/({0, H);sq)] < F{H*) + Me. 



See Appendix IA.2l of our full-version |23i for a detailed proof. 

Similar to 1 -dimensional case, we can obtain tighter bounds for 
Problem l l21b using Li loss functions, but for implementation ease 
we select L2 loss. For ease of exposition, we stated our problem 
formulation and analysis with equal range r for both the attributes 
and equal number of buckets 61 along each dimension. However, 
our method can be easily generalized to different range sizes and 
bucket sizes along each dimension. 

Also, note that for extension from 1-dimensional case to 2 di- 
mensional case, we just rewrote the query cardinality estimation as 
a linear function of our restricted set of parameters W, i.e., 

sq = {Q,BWB^). 

Similarly, for d-dimensions, 

SQ ^ {Q,W XiB X2B--- Xd B), 

where "Xi" is i-th mode tensor product and {A, B) represents ten- 
sor inner product of tensors A and B in d-dimensions. Hence, 



Algorithm 3 EquiHist: Equi-width histogram based method for 
Histogram Estimation (d-dimensional case) 

l<i<N, sqr. 



yyrxrx ...XT 



1 : Input: Training Queries: Qi G 

response cardinality for query Qi 
2: Parameters: k: number of histogram buckets 
3: W solution to l |26t (a d-dimensional tensor) 
4: H — WxiBx2B---XdB, where B is as given in i fTOt 
5: Output: H (d-dimensional histogram with k buckets) 

for d-dimensions, the corresponding least squares problem for our 
EquiHist method would be: 



mm 



1 

vI](sQ.-(Q».W/xiBx2B---XdB))'. 



(26) 

See Algorithm [3] for pseudo-code of our general d-dimensional 
EquiHist method. 

3.4.2 Sparse-recovery Approach 



In Subsection 13.31 we introduced a technique for estimating 1- 
dimensional histograms using sparse-vector recovery techniques. 
In this section, we briefly discuss extension of our approach to 
multiple dimensions. Recall that, we use wavelet transform of a 
histogram to convert it into a sparse-vector, i.e., a. — ^h. Simi- 
larly, for any general d-dimensional histogram H, H can be vec- 
torized and then multi-dimensional wavelet transform can again be 

viewed as a linear orthogonal transform. That is, let h G R*^ be 
an appropriately vectorized version of histogram H G 



urxr-'-xr 



Then, wavelet coefficients a G 
an orthogonal transform '^'^ G R' 



can be obtained by applying 



We omit details for forming and refer interested readers to 1141 . 

Now, as in 1-dimensional case, we can show that if there are at 
most fc-cuboidal buckets in the histogram H, then the number of 
non-zero wavelet coefficients is at most 0(A;r'*^^ logr). In fact, 
in practice the number of non-zero coefficients turn out to be even 
smaller. This can be explained by the fact that in practice most 
of the data is clustered in small pockets and hence the number of 
non-zero coefficients at lowest levels is significantly smaller than 
theoretical bounds. 

Hence, similar to 1-dimensional case, our histogram learning 
problem is reduced to: 



argmm 

SUpp(Q!'^ ) < k 



1 ^ 



) (* ) a 



(27) 



where G R*^ is the vectorized tensor Qi in d-dimensions. Now, 
similar to 1-dimensional case, sparse wavelet coefficients a'' are 
estimated using Orthogonal Matching Pursuit algorithm and then 
the histogram H is obtained after inverse wavelet transform of a"^ 
and re-arranging coefficients appropriately (See Algorithm |2ll. 

Recall that our sparse-recovery method represents histograms by 
their corresponding wavelet coefficients a''. Since, a'* has only k 
non-zero coefficients, the memory footprint of this representation 
is small. But for computing cardinality of an unseen test query, the 
time requirement might be large, especially for large dimensions. 
However, |24 1 showed that for range queries, cardinality estimation 
from k non-zero wavelet coefficients can be performed in 0{kd) 
time using error tree data structure, but with 0{kd) space overhead. 



3.5 Dynamic QFRs and Database Updates 

In previous sections, we assumed a static set of input QFRs and a 
static database. We now present extensions to our algorithms that 
relax these assumptions. 

Dynamic QFRs and updates introduce several engineering chal- 
lenges: (1) Do we keep histograms continuously up-to-date as new 
QFRs are available or update them in a batch fashion periodically 
or when system load is low? (2) How and at what level of detail 
is information about updates conveyed to the learning system? A 
comprehensive study of such engineering considerations is beyond 
the scope of this paper. However, we believe that the extensions we 
present below can form a conceptual basis for implementing many 
engineering design choices addressing the questions above. 

Our extensions are based on two ideas: making the learning al- 
gorithms online and modifying the empirical query distribution by 
biasing it towards recent QFRs. 

Online learning: Online learning algorithms 1181 , at every time 
step t, maintain a current histogram ht. In response to a new QFR 
(qt, Sqt), they suitably modify ht to produce ht+i. Recall that in 
EquiHist algorithm, a histogram ht is parametrized by Wt G R*' 
such that ht — Bwt- To update Wt to Wt+i in response to a 
new QFR {qt,Sqt), we use a well-known strategy called Follow 
the Regularized Leader (FTRL) 1181 . Formally, the update step is 
given by: 



Wt+l 



mm 



qfBw) + A|lw|li 



(28) 



where A > is an appropriately selected constant and note that 
ht+i = Bwt+i- From Equation|28l it might seem that we are just 
"relearning" a histogram from scratch at every time step. However, 
we can show that wt+i can be computed from wt using 0{k^) 
time (independent of t) by maintaining appropriate data structures. 
We can also prove formal guarantees on the error incurred by this 
approach using techniques in H8i . We omit these details due to 
space considerations. 

Similar to EquiHist, SpHist also solves a least squares problem 
once it greedily selects a small set 5 of non-zero wavelet coeffi- 
cients. To make SpHist online, we propose modifying 5 only in- 
frequently (say every night using all QFRs accumulated that day). 
In between these modification, 5 remains unchanged and we can 
update the current histogram using techniques similar to ones we 
presented above for EquiHist. 

Since new QFRs capture changes to workload and data charac- 
teristics, the histogram maintained by online learning algorithms 
can adapt changes to workload and data characteristics. The on- 
line learning algorithms, however, weigh older QFRs, which might 
contain outdated information, and newer QFRs equally. For faster 
adaptation to changes, it might be useful to assign a higher weight 
to recent QFRs as we discuss how to do this next. 

Biasing for recency: Recall that our learning formulation involves 
a query distribution V and that our algorithms approximate V us- 
ing an empirical distribution I) that assigns an equal probability 
that each training sample. To bias for recency, we simply use an 
alternate empirical distribution that assigns a higher probability for 
recent training QFRs compared to older QFRs. The modifications 
to our algorithms to incorporate this change are straightforward and 
we omit the details. 

4 Experiments 

In this section, we empirically evaluate our algorithms (EquiHist 
and Sphist) and present comparison against ISOMER |21 1, current 
state-of-the-art in self-tuning histograms. In particular, we com- 
pare our algorithms and ISOMER on quality of learned histograms 



and on various performance parameters including scalability with 
number of histogram buckets, training data size, dimensionality of 
histograms, and size of attribute domain. We use both real and syn- 
thetic data for our evaluation. 

Our experiments involve using one of the algorithms above to 
learn a histogram from an input training set of QFRs. We use a sep- 
arate test set of QFRs to measure the quality of learned histograms. 
In particular, we measure quality using percentage Average Rela- 
tive Error achieved over the test QFRs: 



Avg. Rel. Error - 



Attest max{100, SqJ 



X 100, (29) 



where Sq^ and Sq^ denote respectively the actual and estimated 
cardinalities of test query q^ and Aiest denotes the number of test 
queries. The same measure is used in ISOMER |21J. 

In Section |4n we discuss details of data and query workloads 
used in our experiments; we also present implementation details of 
algorithms in this section. We present results for 1 -dimensional his- 
tograms in Section l4~2l and multi-dimensional histograms, in Sec- 
tion |4.3l Finally, in Section |4!4l we report results relating to online 
learning for dynamic QFRs. 

4.1 Data, Workload, and Implementation 

For real-world data, we used the Census dataset from the UCI Ma- 
chine Learning Repository |9| also used in STHoles |2|. For syn- 
thetic data, we used the data generator used by STHoles fT\\ all 
synthetic datasets are essentially mixtures of Gaussians. 

We conduct experiments for one-dimensional case using two dif- 
ferent types of synthetic data and a 1-D projection of Census data 
(see Figure[T](a),(b),(c) for the data distributions of the above three 
datasets): 

• Synthetic Type I: For this dataset, we sampled points from a 
mixture of seventeen Gaussians, each with variance 625 and means 
selected uniformly at random from [0, r]. 

• Synthetic Type II: Here, we sampled from a mixture of five 
Gaussians and means of each Gaussian is selected uniformly at ran- 
dom. The variance is selected to be just 100, leading to "spiky" dis- 
tribution, i.e., most records are concentrated around a small number 
of attribute-values. 

• Census 1-D: We use the Age attribute of the standard Census 
dataset with 199, 523 database records. Range (r) here is 91. 

Similarly, for multi-dimensional histogram experiments, we gen- 
erated synthetic data using multi-dimensional Gaussians and use 
multi-dimensional projections of Census data. That is, 

• Synthetic Multi-D: We generated 2 and 3 dimensional datasets 
for a given range by sampling from a mixture of spherical Gaus- 
sians of corresponding number of dimensions. For 2-dimensional 
datasets, we used a mixture of 9 Gaussians with random means and 
variance equal to 100. For 3-dimensional case we used a mixture of 
5 Gaussians with random means and variance set to 25. The range 
along each attribute was fixed to 32. 

• Census Multi-D: We used the 2-dimensional dataset obtained by 
selecting the "Age" and "Number of Weeks worked" attributes. For 
the 3-dimensional dataset we chose the attributes of "Age", "Mari- 
tal status" and "Education". 

Given the above datasets, we now describe the models to gener- 
ate QFRs used for training and testing learned histograms. We used 
two standard models of range query generation models proposed by 
fTTl and later used by 1 2l: 

• Data-dependent Query Model: In this model, first query "cen- 
ter" is sampled from the underlying data distribution. Then, the 
query is given by a hyper-rectangle whose centroid is given by the 
generated "center" and whose volume is at most 20% of the total 
volume. 
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Figure 1: a) Synthetic Type I data distribution, b) "spiky" Synthetic Type II data distribution, c) Census 1-D data distribution, d) Relative error incurred by 
various methods as range of the attribute in Synthetic Type I dataset varies. Clearly, SpHist and EquiHist scales better with increasing range than ISOMER. 
Also, as expected due to Theorem|2] the error increases at sub-linear rate with increasing range. 



• Uniform Query Model: In tliis model, query "centers" are se- 
lected uniformly at random from the data range. Then, similar to 
the above query model, each query is a hyper-rectangle generated 
around the "center" and volume at most 20% of the total volume. 

As mentioned in |2|, the above two models are considered to be 
fairly realistic and mimics many real-world query workloads. We 
generated separate training and test sets (of QFRs) in all the exper- 
iments. In each of the experiments, we evaluated various methods 
using a test set of 5000 QFRs. 

Implementation Details: For experiments with one-dimensional 
histograms, we implemented both of our methods EquiHist and 
SpHist, as well as ISOMER using Matlab. We modified an C-l~l- 
implementation of STHoles |2| for multi-dimensional histograms 
experiments. For these experiments, we implemented both ISO- 
MER as well as our equi-width approach (EquiHist) using C-l~l-. 
SpHist was implemented in MATLAB. For each experiment, we 
report numbers averaged over 10 runs. 

For solving the max-entropy problem in ISOMER, we use an 
iterative solver based on Bregman's method |6|. We found Breg- 
man's method for solving max-entropy problem to be significant 
faster than the Iterative Scaling method used by 1211 . 

4.2 Results for One-Dimensional Histograms 

We now present results for 1-D histograms and study how the per- 
formance varies under different conditions: first, as the number of 
training queries increases, second, as the number of buckets in the 
histogram being learnt increases, and finally, as the range of at- 
tribute value increases. 
Varying Number of Training Queries: 

We first compare our EquiHist and SpHist method with ISOMER 
for varying number of training queries. Figure|2](a) compares rel- 
ative error incurred on test queries by the three methods on Syn- 
thetic Type I dataset for queries generated from Uniform Query 
Model. Here, we vary the number of training queries from 25 to 
700, while the range r of attribute values is fixed to be 1024 and 
the number of buckets in the histogram is fixed to be 20. Naturally, 
the error incurred by each of the methods decreases with increas- 
ing number of training queries. However, both of our methods are 
able to decrease the relative error more rapidly. For example, in 
around 200 queries, error converges for both EquiHist and SpHist. 
In contrast, error incurred by ISOMER decreases slowly and oscil- 
lates, primary reason being in the final round ISOMER uses only 
twice the number of queries (approximately) as number of buckets 
(20) and hence over-fits in some runs. Furthermore, even with 700 
queries, our SpHist method is 1.4% more accurate than ISOMER, 
while EquiHist is 0.5% more accurate. 

Next, we compare the three methods on Synthetic Type I dataset 
with queries generated from Data-dependent Query Model (See 
Fig- El (b)). Here again, both EquiHist and SpHist requires only 



300 training queries to converge, and are about 0.3% and 1.6% 
more accurate than ISOMER. 
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Figure 3: Figure shows learned histogram by ISOMER (top plot), Equi- 
Hist (middle plot) and SpHist (bottom plot) for "spiky" Synthetic Type II 
dataset with 700 data-dependent queries and 20 buckets. Clearly, bucket 
boundaries discovered by ISOMER do not ahgn well with the peaks of the 
true frequency distribution(see Figure[2](c)), leading to high error EquiHist 
is constrained to partition range at equal intervals, hence is mis-aligned with 
several peaks. In contrast, SpHist is able to accurately align bucket bound- 
aries to the true frequency distribution, hence incurs less test error 

In Fig. [2] (c) we compare performances on the spiky Synthetic 
Type II dataset with queries generated from Data-dependent Query 
Model. For this experiment, all the three methods converge at 
about 300 queries. However, SpHist is significantly more accurate 
than both ISOMER and EquiHist. Specifically, SpHist incurs only 
1.37% error, while EquiHist incurs 7.85% error and ISOMER in- 
curs 26.87% error. EquiHist naturally is a little inaccurate as Equi- 
Hist's bin boundaries will typically be much wider than optimal 
histograms boundaries. Interestingly, SpHist is able to learn cor- 
rect bucket boundaries with small number of training queries and 
hence provides a histogram very similar to the underlying distri- 
bution. Figure [5] shows the recovered histograms by the different 
methods overlayed on the true frequency distribution. We observe 
that SpHist is able to align bin boundaries accurately with respect 
to the true distribution. In comparison, ISOMER and EquiHist's 
buckets are not as well aligned, leading to higher test errors. 

In Fig.[2](d) we compare the three methods on the Census 1-D 
dataset and queries generated from Uniform Query Model. As in 
the previous case, SpHist incurs less error than both ISOMER and 
EquiHist (1.5% and 1.0% less error respectively), and is able to 
learn from a smaller number of training queries. 



Synthetic Type I, Uniform Query Model 
No. of Buckets = 20; Range = 1024 



Synthetic Type I, Data-dependent Query Model 
No. of Buckets = 20; Range = 1024 



Synthetic Type 11 ("Spiky" Data), Data-dependent Query Model 
No. of Buckets = 20; Range = 1024 



Census 1-D ("Age" Attribute), Uniform Query Model 
No. of Buckets = 8; Range = 128 
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Figure 2: Comparison of average relative error (on log-scale) with varying number of training queries, (a) Test error for Synthetic Type I dataset with queries 
generated from Uniform Query model. Both EquiHist and SpHist converges at around 200 training queries only, and incurs 0.5% and 1.4% less error than 
ISOMER, respectively, for 700 training queries, (b) Test error on Synthetic Type I dataset with queries generated from Data-dependent Query Model. Here 
again, both EquiHist and SpHist converges at around 200 training queries and are finally obtains 0.3% and 1.6% less error than ISOMER, (c) Test error on 
"spiky" Synthetic Type II dataset with queries from Data- dependent Query model. For 700 training queries, ISOMER incurs 26.87% en'or, while SpHist 
incurs only 1.37% error, (d) Test error on the Age attribute of Census 1-d data with queries generate from Uniform Query model. For 200 queries, EquiHist 
incurs approximately 0.5% less error than ISOMER, while SpHist incurs 1.5% less error. 



Synthetic Type I, Uniform Query Model 
No. of training queries = 400; Range = 1 024 



Syntfietic Type I, Data-dependent Query Model 
No. of training queries = 400; Range = 1 024 



Synthetic Type II ("Spiky" Data), Data-dependent Query Model 
No. of training queries = 400; Range - 1024 




Census 1-D ("Age" Attribute), Uniform Query Model 
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Figure 4: Comparison of average relative error (on log-scale) with varying number of histogram buckets, (a) Test error on Synthetic Type I datasets with 
queries generated from Uniform Query model. For most values, both EquiHist and SpHist are significantly more accurate than ISOMER. As expected, for 
small number of buckets, SpHist is more accurate than EquiHist while for large number of buckets, EquiHist incurs less error. In particular, for 10 buckets, 
SpHist is 18% more accurate than ISOMER and 6% more accurate than EquiHist. (b) Test error on Synthetic Type I dataset with queries generated from 
Data-dependent Query Model. Here again, SpHist is around 5% more accurate than EquiHist for 10 buckets and is around 0.3% less accurate than ISOMER, 
(c) Test error on Synthetic Type II dataset with queries from Data-dependent Query model. Here, SpHist incurs significantly less error than both ISOMER 
(by 21%) and EquiHist (by 6%) for 10 buckets, (d) Test error on Census 1-D dataset with queries generate from Uniform Query model. For 5 buckets, 
SpHist incurs about 2% less en'or than EquiHist and 20% less en'or than ISOMER. 



Varying Number of Buckets: 

Here, we study our methods as number of buckets in the histograms 
vary. First, we consider Synthetic Type I dataset and vary number 
of buckets from 10 to 100, while range and number of training 
queries are fixed to be 1024 and 400, respectively. Figure |4] (a) 
shows error incurred by the three methods for varying number of 
buckets when queries are generated using Uniform Query Model. 
Clearly, for small number of buckets, SpHist achieves significantly 
better error rates than EquiHist and ISOMER. Specifically, for 10 
buckets, SpHist incurs 8.3% error, while EquiHist incurs 14.09% 
error and ISOMER incurs 26.84% error However, EquiHist per- 
forms better than both ISOMER and SpHist as number of buckets 
increase. Figure|4](b) shows a similar trend when queries are gener- 
ated from Data-dependent Query Model. Here, interestingly, for 
10 buckets, ISOMER performs significantly better than EquiHist 
and performs similar to SpHist. However, with larger number of 
buckets ISOMER converges to significantly higher error than both 
EquiHist and SpHist. 

Next, we consider the Synthetic Type II dataset with Data- 
dependent Query Model and vary number of buckets from 10 to 
1000. Figure|4](c) compares test error incurred by the three meth- 
ods. Here again, SpHist performs best of the three methods. In 
particular, for 10 buckets, SpHist incurs 5.48% error while Equi- 
Hist incurs 12.68% error and ISOMER incurs 26.66% error. 

Finally, we consider Census 1-D data with queries drawn from 
Uniform Query Model. We vary the number of buckets from 5 
to 50; note that the range of Age attribute is only 91. Similar to 
the above experiments, SpHist incurs significantly less error than 
ISOMER (by 20%) and EquiHist (by 2%). 



Varying Range of Attribute Values: 

In the next experiment, we study performance of the different meth- 
ods for varying range of attribute values. Here, we use Synthetic 
Type I dataset with Data-dependent Query Model and fix the 
number of buckets to 15, the number of training queries are fixed to 
be 200. Figure [T](d) compares error incurred by SpHist and Equi- 
Hist to ISOMER on Synthetic Type I dataset with queries from 
Data-dependent Query Model. Here again, our methods are sig- 
nificantly better than ISOMER. Also, as predicted by our theoret- 
ical results (see Theorem[2}, EquiHist does not depend heavily on 
the range and is able to learn low-error histograms with small num- 
ber of queries. Similar trends were observed for Uniform Query 
Model and Synthetic Type II data as well. 

We summarize our results for 1-dimensional histogram settings 
as follows: 

• Both EquiHist and SpHist converge quicker than ISOMER with 
respect to number of queries and in general incurs less error for all 
training queries numbers. 

• SpHist incurs significantly less error than EquiHist and ISOMER 
for "spiky" data (Synthetic Type II dataset). 

• EquiHist and SpHist consistently outperform ISOMER with vary- 
ing number of buckets. 

• SpHist demonstrates a clear advantage over the other two meth- 
ods for smaller number of buckets. For larger number of buckets, 
EquiHist incurs less error than SpHist. 

• Both EquiHist and SpHist scale well with increasing range of the 
attribute values. 
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No. of Buckets = 64, Range = 128 
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Census 2-D, Data-dependent Query Model 
No. of Buckets = 64, Range = 128 
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Figure 5: Comparison of average relative error (on log-scale) for various methods on two-dimensional datasets with queries generated from Data-dependent 
Query model, (a) Test error on Synthetic 2-D dataset with varying number of training queries. For 1200 queries, both SpHist and EquiHist incurs about 26% 
less error than ISOMER, (b) Test eiTor for Synthetic 2-D dataset with varying number of buckets. For 16 buckets, SpHist incurs 13.95% error while EquiHist 
incurs 26.68% error and ISOMER incurs 33.99% error, (c) Test error for Census 2-D data with varying number of training queries. For 1200 queries, SpHist 
incurs 4.64% error, while EquiHist incurs 109.54% error and ISOMER incurs 35.55% eiTor. EquiHist incurs more error than both SpHist and ISOMER due 
to "spikiness" of the data (see Figure 0(a)). (d) Test error for Census 2-D data with varying number of histogram buckets. For 16 buckets, SpHist incurs 
10.21% error while ISOMER incurs 66.34% eiTor and EquiHist incurs 148.7% en'or. Similar to plot (b), EquiHist incurs larger error due to heavily skewed 
data (see FigureQ(a)). 
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Figure 6: Comparison of average relative error (on log-scale) for various methods on three-dimensional datasets with queries generated from Data-dependent 
Query model, a), b) Test en'or on Synthetic 3-D dataset with varying number of training queries and varying number of buckets. Here, both EquiHist and 
SpHist are able to learn reasonable histograms (incurs about 10% error) and follow similar trends to two-dimensional experiments (see Figure[5). We do not 
report test error for ISOMER, as our implementation of ISOMER did not finish even after two days, c). d) Test error on Census 3-D dataset with varying 
number of training queries and number of buckets . Similar to Census 2-D dataset, SpHist incurs significantly less error than both EquiHist and ISOMER. For 
example, in plot (c), for 1500 queries, SpHist incurs about 11% less error than ISOMER and 30% less error than EquiHist. 



4.3 Multi-dimensional Histograms 

In this section, we empirically compare our EquiHist and SpHist 
methods with ISOMER for learning multi-dimensional histograms. 
For these experiments also, we use synthetic as well Census data. 
Also, we use Data-dependent Query Model for all the multi- 
dimensional histogram experiments. 

Experiments with 2-dimensional Datasets: We first compare our 
methods to ISOMER for varying number of training queries on 
Synthetic 2-D dataset. Figure [5] (a) shows the test error obtained 
by all the three methods for different number of training queries, 
where queries are generated using Data-dependent Query Model 
and the number of histogram buckets is fixed to be 64. Clearly, our 
methods outperform ISOMER and are able to reduce error rapidly 
with increasing number of training queries. For example, for 1000 
training queries, both SpHist and EquiHist incurs about 6.0% error 
while ISOMER incurs 31% error. 

Next, we compare the three methods on Synthetic 2-D dataset, 
while varying number of buckets, with number of training queries 
is fixed at 2000 (see Figure |5](b)). Here again, both EquiHist and 
SpHist outperform ISOMER for small number of buckets. For 128 
buckets, ISOMER incurs 10.36% error while EquiHist and SphHist 
incur around 2.91% and 3.83% error respectively. 

In the next set of experiments, we compare the performance of 
the three algorithms on real- world Census 2-D dataset (see Figure[5] 
(c)). Recall that Census 2-D dataset projects Census data on "Age" 
and "Number of Weeks Worked" attributes. Now, for most database 
records, "Number of Weeks Worked" are concentrated around ei- 
ther or 53 weeks. That is, the data is extremely "spiky". However, 



EquiHist still tries to approximate the entire space using equi-width 
buckets leading to several "empty" buckets. Consequently, Equi- 
Hist incurs large error (109% for 1200 queries), while ISOMER 
also incurs 35.55% error. However, SpHist is still able to approxi- 
mate the underlying distribution well and incurs only 4.64% error. 
Figure [7] (b) shows the histogram estimated by SpHist with 1000 
queries and 64 buckets. Clearly, SpHist is able to capture high- 
density regions well; there are small peaks in low-density areas 
which contribute to the error that SpHist incurs. 

Finally, we report test error incurred by the three methods on 
Census-2D dataset in Figure|5](d), as we vary the number of buck- 
ets from 5 to 2048, while number of training queries is fixed at 
2000. For this data as well, SpHist outperforms both EquiHist 
and ISOMER significantly, especially for small number of buckets. 
Specifically, for 16 buckets, SpHist incurs only 10.21% relative er- 
ror while EquiHist and ISOMER incur about 148.7% and 66.34% 
relative error, respectively. 

3-dimensional Datasets: We now consider 3-dimensional datasets 
to study scalability of our methods with increasing dimensions. 

We first conduct experiments on the Synthetic 3-D dataset, which 
is drawn from a mixture of spherical 3-D Gaussians. Figure |6] 
(a) shows relative error incurred by our methods when queries are 
generated from Data-dependent Query Model and the number 
of training queries vary, while number of buckets is fixed at 216. 
For 2000 queries, SpHist incurs 5.59% error while EquiHist in- 
curs 8.39% error. Note that we are not able to report error incurred 
by ISOMER as our implementation of ISOMER did not terminate 
even after running for two days. Primary reason being, even in 3- 
dimensions, number of variables in ISOMER'S max-entropy prob- 
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Figure 7: a) Frequency distribution of "Age" and "Number of Weeks Worked" attributes of Census 2-D data, b) Estimated histogram for Census 2-D data 
using SpHist method with 1000 training queries and 64 buckets (See plot (a) for true distribution). SpHist is able to recover the underlying distribution well 
using small number of buckets and training queries. In particular, the high density areas for "Weeks Worked" attribute values and 52 are captured accurately, 
c) Running times of SpHist and EquiHist for Synthetic 3-D dataset (on a dual-core 2GHz processor with 4GB RAM). Note that our methods mostly finish with 
in 30 seconds, while on the same experiment our implementation of ISOMER did not finish in 2 days, d) Streaming queries and database updates experiment. 
Test error incurred by online version of EquiHist as well as batch version for Synthetic Type I dataset. At each step, one training query is provided; at 1000-th 
step database is updated by randomly permuting 30% of the database. EquiHist (1-1000) represents batch Equihist histogram obtained by training on first 
1000 queries, while EquiHist( 100 1-2000) trains on the batch of 1001 - 2000-th queries. 



lem becomes large, e.g., for our experiments with 1000 queries 
ISOMER had 27, 000 variables in the first round. Furthermore, 
ISOMER needs to iteratively re-optimize after throwing out a fixed 
number of queries, thus further increasing run-times. In contrast, 
both of our methods are very efficient, and need to run the opti- 
mization only once. We observed that in all our experiments, our 
methods terminated in less than 1 minutes (see Figure[7{c)). 

Next, we vary number of buckets while number of training queries 
is fixed to be 2000. Figure [6] (b) shows relative error incurred by 
our methods. Clearly, for small number of buckets SpHist is sig- 
nificantly better than EquiHist. For example, for 16 buckets SpHist 
incurs 19.51% error while EquiHist incurs 57.36% error. 

Finally, we repeat these experiments on our Census 3-D dataset 
(which, considers the "Age", "Marital Status" and "Education" as 
the 3 attributes), i.e., we vary number of training queries (Figure[6] 
(c)) as well as number of buckets (Figure [S^d)). Overall, we ob- 
serve the same trend as 2-dimensional case with SpHist being most 
accurate and EquiHist being most inaccurate, e.g., for 200 buck- 
ets SpHist incurs 7% error while ISOMER incurs 20% error and 
EquiHist incurs 38% error. 

4.4 Dynamic QFRs and data updates 

We now study the performance of online version of EquiHist (Sec- 
tion [5311 for dynamic QFRs and in the presence of data updates. 
(The performance of online SpHist is similar when the set 5 of 
non-zero wavelet coefficients is kept unchanged.) The goal is to 
show that our online updates are effective, converge to the optimal 
batch solution quickly, and are robust to database updates. 

For this experiment, we consider Synthetic Type I dataset with 
queries generated from the Uniform Query Model. We compare 
our online version of EquiHist against the batch-version of Equi- 
Hist. After each update of the online version, we measure relative 
error on 5000 test queries. We also, measure relative test error 
incurred by the batch EquiHist method (that can observe all the 
training queries beforehand). Figure [7] (d) compares relative error 
incurred by online EquiHist to batch method. For 1 to 1000 steps, 
the database remains the same. Clearly, online learning algorithm 
quickly (in around 250 steps) converges to relative error similar to 
batch method (red line) with 1000 training queries. 

Now at 1000-th step, we update our database by randomly per- 
turbing 30% of the database. This leads to larger error for online 
EquiHist for a few steps after 1000-th step (see Figure|7](d)), how- 
ever it quickly converges to the optimal batch solution (blue line) 
for the updated database (using queries 1001 to 2000). 



4.5 Comparison to ISOMER 

Here, we summarize our experimental results and discuss some of 
the advantages that our method enjoys over ISOMER. 

• ISOMER reduces the number of histogram buckets by removing 
queries. Hence for small number of buckets, ISOMER removes 
almost all the queries and ends up over-fitting to a few queries, 
leading to high relative error. In contrast, our methods use all the 
queries and hence are able to generalize significantly better. 

• ISOMER has to re-run max-entropy solver after each query- 
reduction step, hence time required by ISOMER is significantly 
larger than our methods. 

• For multi-dimensional case, ISOMER'S data structure forms 
large number of buckets even though the final number of buckets 
required is small. For this reason, ISOMER doesn't scale well to 
high-dimensions; in contrast, our method apriori fixes the number 
of buckets and hence scales fairly well with high-dimensions. 

• Database updates lead to inconsistent QFRs, which which needs 
to be thrown away by ISOMER in a heuristic manner. In compari- 
son, our methods easily extend to database updates. 

5 Related Work 

Simplicity and efficiency of histograms have made them the choice 
data-structure to summarize information for cardinality estimation, 
a critical component of query optimization. Consequently, his- 
tograms are used in most commercial database systems. Most of 
the prior work on histograms has focused on constructing and main- 
taining histograms from data, and we refer the reader to lllj for a 
survey. 

Self-tuning histograms were first introduced by fTl to exploit 
workload information for more accurate cardinality estimation. The 
method proposed in 1 1| is typically referred to as STGrid and it 
merges and splits buckets according to bin densities. However, it 
does not have any known provable bounds on the expected relative 
error and is restricted by the grid structure for high-dimensional 
cases. 1 3 1 introduced STHoles data-structure which is significantly 
more powerful than the simple grid structure used by STGrid. How- 
ever, this method requires fine-grained query feedback correspond- 
ing to each bucket in the current histogram and therefore imposes 
a nontrivial overhead while collecting feedback. Furthermore, if 
the number of queries is small then STHoles is not "consistent", 
i.e., different ways of constructing data- structure can lead to dif- 
ferent query cardinality estimations . This consistency problem 
was addressed by \2V\ who designed a maximum entropy based 
method called ISOMER. ISOMER uses a data-structure similar 
to STHoles, but learns frequency values in each bucket using a 
maximum-entropy solution. Note that, ISOMER is considered to 



be the state-of-the-art method for self-tuning histograms fS] and 
hence we compare against the same both theoretically as well as 
empirically. Recent work ITsI 1191 has extended to maximum en- 
tropy approach to handle feedback involving distinct values. 1 13| 
also present a histogram construction that is similar to EquiHist, 
but this algorithm only handles 1 -dimensional histograms. Self- 
tuning histograms are part of a larger effort that seeks to leverage 
execution feedback for query optimization 1 15 1. 

Wavelets are a popular signal-processing tool for compressing 
signals. Haar wavelets are one of the most popular and simple 
wavelets that are especially effective for piecewise constant sig- 
nals. As histograms are piecewise constant signals, Haar wavelets 
are extensively used in the context of databases. Specifically, 1 16| 
introduced a wavelet based histogram that can be used for selectiv- 
ity estimation. Similarly, (7) also defined a method for selectivity 
estimation using wavelets. 1101 introduced a probabilistic method 
to decide the wavelet coefficients to be used and provides error 
guarantees for the same. However, most of these methods compute 
wavelets coefficients using a complete scan of the database and are 
not self-tuning. In contrast, we introduce a method that uses sparse- 
vector recovery techniques to learn appropriate wavelet coefficients 
for estimating self-tuning histograms. 

6 Conclusions 

In this paper, we introduced a learning theoretic framework for 
the problem of self-tuning histograms. We cast the problem in an 
empirical loss minimization framework, and propose two different 
approaches in this framework. Our first approach (EquiHist) effi- 
ciently learns well-known equi- width histograms. We also show 
that the equi-width approach, despite limitations, can still solve our 
histogram estimation problem up to an additive approximation fac- 
tor while requiring only a finite number of training queries. To 
the best of our knowledge, this is the first theoretical guarantee for 
equi-width histograms in the context of self-tuning histograms. 

However, in high-dimensions where data is sparse or for "spiky" 
datasets, equi-width approach suffers as it wastes many buckets on 
empty region. Our second approach (SpHist) handles this prob- 
lem where, by using Haar wavelet transform, we cast the problem 
as that of learning a sparse vector. Next, we adapt the popular 
Orthogonal Matching Pursuit (OMP) method \22i for solving the 
transformed problem. 

Both of our techniques can be easily extended to multi-dimensional 
settings, dynamic QFRs and database updates. To demonstrate ef- 
fectiveness of our methods in all these scenarios, we provide a va- 
riety of empirical results. Overall, our empirical results show that 
SpHist is consistently better than EquiHist as well as ISOMER, es- 
pecially for multi-dimensional datasets as well as for small number 
of buckets — both critical parameters for real- world databases. For 
example, SpHist is able to recover back the true distribution rea- 
sonably well for Census 2-D dataset (see Figure[7lb)). 

For future work, we intend to work on theoretical analysis of 
SpHist. Another interesting direction is further study of the multi- 
dimensional buckets output by SpHist, so as to further improve its 
efficiency. Finally, we intend to apply our techniques to real-world 
query workloads. 
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APPENDIX 

A Proofs of Equi- width Approach 

A.l Proof of TheoremUl 

Proof. Now, 

F(h) - F(h*) = F(h) - F(h) + F(h) - F(h*), (30) 
^Ei+E2, (31) 

where h £ C' is given by h = Bw, w being the optimal solution to 
(Bll, El = F(h)-F(h) is the excess generalization error incurred 
by h compared to the optimal solution h and E2 = F{h) — F(h* ) 
is the difference between optimal error achievable by histograms in 
C' to the histograms in C. Intuitively, E2 measures how expressive 
set C' is w.r.t. C. Below we bound both the error individually and 
finally combine the two errors to obtain error bound on -F(h) — 
F(h*). 

Bound on Ei: Let h = _Bw. Now, since Gt>{-) is the regularized 
expected risk under a convex risk function (/(q-^Bw; s,)) we can 
use standard results from stochastic convex optimization to bound 
the generalization error. In particular, using Theorem [U by |20|, 
with probability 1 — 5: 



iq.i,[/(q^ Bw; s„)] - AB(— w) < E„.i,[/(q^ Bw; s^)] 



r _ . „ / n^L^M^blogi 



where: 



-Ai/(-w) + 



rAXN 



(32) 



• Q = maxq^B II B q||. Assuming each query covers only 
\Q\ attribute values, SI — \Q\. 

• Lf is the Lipschitz constant of function f{u;Sq) w.r.t. u. 
Note that as is a compact set, < max„gn||K;||2 II^./('^)I|2- 

• A > is a constant. Also, note that entropy function B'( -jg^w) 
is -jg^ -strongly convex. 

Now, H{j^'w) < logfe. Hence, using l l32t . 



F(w) - F{w) < A log & + O 



QI^L^M^blogi 
rAATV 



/ /blogfelogl , . 



(33) 



where the second inequality follows by selecting 

X=\l .\, M\Q\L 



rloRbAN 
That is. 

El = F(h) - F(h) = F(w) - F(w) 



, / b log b log i 
<Oi\I^^M\Q\L,). (34) 

Bound on E2: To bound E2, we first observe that 
F{h.) < F(h) + Alog&,Vh e C. 

Hence, 

£'2<F(h)-F(h*) + £'i,VhGC'. (35) 

Now, given any h* we construct a new vector h G C' for which 
we can bound the error F{h ) — F(h*). 



Now, h has b buckets and each the value in each bucket ( i.e., 
w*, 1 < 6) is the average of the histogram heights of h* in that 
particular bucket. Formally, 



1 



(36) 



See Figure[8]for an illustration of our conversion scheme from h* 
toil*. 




Figure 8: Conversion of h* to h . Top figure shows h* while the bottom 
one shows h* . Note that buckets 1 .3,4.6,7 of h* lie with-in buckets of h* , 
hence we assign same heights to them as the coiTesponding buckets in h* . 
Buckets 2, 5 of h are at the intersection of of buckets 1 and 2, 2 and 3 of 
h* , hence a convex combination of heights of those buckets are assigned to 
Buckets 2, 5 of h . 



Now, note that, 



^h; = ^h* = M. 



Furthermore, assuming ^ < A (i.e., width of buckets in h is 
smaller than the smallest bucket of h*), average is over at most two 
different heights of h* in each bucket of h . Since there are only 
k buckets in h*, only k — 1 buckets in h are at the intersection 
of buckets in h* . Let / be the set of all the buckets in h_ that 
are at intersection of buckets in h*. Now consider ||h* — h H^. 
Clearly, if a bucket in h do not lie in h* then it do not contribute 
to ||h* — h II2, as value of attributes in that bucket is same as the 
value of attributes in the corresponding bucket of h* . Thus, 



(37) 



where last inequality follows using Cauchy-Schwarz inequality. Now, 
1 1 h* 1 1 00 < each bucket is of at least A width and there are at 

most M records to fill in one bucket. Similarly, 

^^h* < ||h*||i = M. 
Hence, using l l37t , with the above observations 



llh* 



M 



(38) 



Now, we bound F(h*) — F{h*): 

F(h*) - F{h') = Eq.i,[/(q^h*; s,) - /(q^r ; s,)], 

< Eq^i,[L/|q^h* -q^h*|], 

< Eq^i,[Lj||q||2||h* ^h'lla], 

, M fr 

< V\Q\Lf ^Jl, (39) 



where last inequality follows from the fact that q is over IQj at- 
tributes only and using l |38l l. 
Now using l |39t and l l35t : 



\Q\L 



M_ 



(40) 



Hence, by combining OU . J34l (. J40b : 

M 



F(h) - F(h*) < 




where last inequality follows by selecting N using: 



iV > Ci ( ^ 



(41) 



(42) 



where Ci > is a constant. Now, second inequality in l l41b follows 
by selecting b using: 



C2r 



IQ|A 



1/2 



log I 



(43) 



where last inequality follows using fc > as A is the minimum 
bucket width and C2 > is a constant. 
Hence proved. □ 

A.2 Proof of TheoremH 

Proof. Similar to proof of Theorem|2l 

F{H) - F{H*) = F{H) - F{H) + F{H) - F{H*), (44) 
= Ei+E2, (45) 

where H e C is given hy H = BWB^. 
Recall that. 



Gt,{W) = Et, [fi{Q,BWB^);sQ) 



XhC-^W), (46) 



and W is the optimal solution to l|46j. Also, Ei = F{H) - F{H) 
is the excess generalization error incurred by H compared to the 
optimal solution H and E2 — F{H) — F{H*) is the difference 
between optimal error achievable by histograms in C' to the his- 
tograms in C. Below we bound both the errors individually: 
Bound on Ei: Recall that H = BWB"^ . Also, note that 
{Q,BWB'^) = {B'^QB,W) is the inner product function over 



matrices. As Gd(-) is the regularized empirical risk under a con- 
vex risk function /, using Theorem[T] with probability 1 — S: 



EQ^T,[f{{Q,BWB');sQ)]-\H{ — W) 

< EQ.i,[/({g, BWB^y,SQ)] - ^H{^W) 

+ :AAiV > '''' 

where: 

• I// is the Lipschitz constant of function /(u; Sq) w.r.t. u, over 
a finite set. Hence, Lj < max|„|<n||K||2 ll^/Mlla 

• ft = maxq^x" \\B'^ QB\\f. Assuming each query covers 
only \Q\ attribute values, Q = \Q\. 

• A is a constant. Also, note that entropy function H{jy^W) is 

-strongly convex. 
Now, H{f^W) < log 6. Hence, using 



F(w) - F(w) < A log & + O 



Lj\Q\^MHlog^ 
r^AXN 



<o{ 



& log 6 log i 



LfM\Q\), 



r^AN 

where the second inequality follows by selecting 



(48) 



A = 



6 log 



r2 1og6AiV 



M\Q\Lf. 



That is. 



El = F{H) - F{H) = F{W) - F{W) 
< O 



6 log 6 log I 



r2AiV 



LfM\Q\ . (49) 



Bound on E2: To bound E2, we first observe that 
F{H) < F{H) + Xlogb,\/ H e C. 

Hence, 

E2 < F{H) - F{H*) + EiyH e C. 



(50) 



Now, given any H* we construct a new vector H* G C' for which 
we can bound the error F{H ) — F{H*). 

Note that along any of the axis, the number of 1-dimensional 
buckets are at most k. Hence, using error analysis from 1-dimensional 
case (see Equation|38l), 

\H' -H*\\f < (51) 



Now, we bound F(h*) — F{h 



F{H')~F(H ) =EQ^T,[f{{Q,H');sQ) - f{{Q,H );sq)], 
<EQ^Ty[Lf\{Q,H')~{Q,H')\], 
<EQ^T>[Lf\\Q\\F\\H* -H*\\f], 
, M r 

<^/\^ir^, (52) 



where the second inequality follows using Lipschitz property of /, 
and last inequality follows from the fact that q is over \ Q\ attributes 
only and using dSlb . 



Now using \52\ and dSOt : 



E2<Lf^\^^+Ei. (53) 



Hence, by combining l |45l >, l |49l ), l |53l l: 

F{H)-F{H')<Lfy^\^^^ 



< M I L/ ( ^ 



A J \ N 
< Me, (54) 
where last inequality follows by selecting N using: 

Now, second inequality in ( 1541 ) follows by selecting b using: 
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where last inequality follows using k > ^ as A is the minimum 
bucket width. Hence proved. □ 



